Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR fixes the regression introduced by #36683.

import pandas as pd
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0)
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False)
spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()

spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1)
spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()

Before

/.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false.
  range() arg 3 must not be zero
  warn(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame
    return super(SparkSession, self).createDataFrame(  # type: ignore[call-overload]
  File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame
    return self._create_from_pandas_with_arrow(data, schema, timezone)
  File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow
    pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step))
ValueError: range() arg 3 must not be zero
Empty DataFrame
Columns: [a]
Index: []

After

     a
0  123
     a
0  123

Why are the changes needed?

It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5.

Does this PR introduce any user-facing change?

Yes, it fixes a regression as described above.

How was this patch tested?

Unittest was added.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon HyukjinKwon changed the title Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch Feb 16, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs)

@HyukjinKwon
Copy link
Member Author

Merged to master, brnach-3.5 and branch-3.4.

HyukjinKwon added a commit that referenced this pull request Feb 16, 2024
…ution.arrow.maxRecordsPerBatch

This PR fixes the regression introduced by #36683.

```python
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0)
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False)
spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()

spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1)
spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()
```

**Before**

```
/.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false.
  range() arg 3 must not be zero
  warn(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame
    return super(SparkSession, self).createDataFrame(  # type: ignore[call-overload]
  File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame
    return self._create_from_pandas_with_arrow(data, schema, timezone)
  File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow
    pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step))
ValueError: range() arg 3 must not be zero
```
```
Empty DataFrame
Columns: [a]
Index: []
```

**After**

```
     a
0  123
```

```
     a
0  123
```

It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5.

Yes, it fixes a regression as described above.

Unittest was added.

No.

Closes #45132 from HyukjinKwon/SPARK-47068.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 3bb762d)
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Feb 16, 2024
…ution.arrow.maxRecordsPerBatch

This PR fixes the regression introduced by #36683.

```python
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0)
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False)
spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()

spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1)
spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()
```

**Before**

```
/.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false.
  range() arg 3 must not be zero
  warn(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame
    return super(SparkSession, self).createDataFrame(  # type: ignore[call-overload]
  File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame
    return self._create_from_pandas_with_arrow(data, schema, timezone)
  File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow
    pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step))
ValueError: range() arg 3 must not be zero
```
```
Empty DataFrame
Columns: [a]
Index: []
```

**After**

```
     a
0  123
```

```
     a
0  123
```

It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5.

Yes, it fixes a regression as described above.

Unittest was added.

No.

Closes #45132 from HyukjinKwon/SPARK-47068.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 3bb762d)
Signed-off-by: Hyukjin Kwon <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Mar 26, 2024
…ution.arrow.maxRecordsPerBatch

This PR fixes the regression introduced by apache#36683.

```python
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0)
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False)
spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()

spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1)
spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()
```

**Before**

```
/.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false.
  range() arg 3 must not be zero
  warn(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame
    return super(SparkSession, self).createDataFrame(  # type: ignore[call-overload]
  File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame
    return self._create_from_pandas_with_arrow(data, schema, timezone)
  File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow
    pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step))
ValueError: range() arg 3 must not be zero
```
```
Empty DataFrame
Columns: [a]
Index: []
```

**After**

```
     a
0  123
```

```
     a
0  123
```

It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5.

Yes, it fixes a regression as described above.

Unittest was added.

No.

Closes apache#45132 from HyukjinKwon/SPARK-47068.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 3bb762d)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants